cornelius.senf@geo.hu-berlin.de

Today's session:

  • Cluster analysis (kmeans and hierarchical)
  • Reading raster files
  • Principal component analysis

Cluster analysis

Cluster analysis

Cluster analysis assignes each data point to one of k clusters, thereby maximmizing the between-cluster variance and minimizing the within-cluster variance.

There are two principal types of cluster algorithms: partionating algorithms (e.g. kmeans) and hierarchical algorithms.

kmeans

A kmeans cluster analysis can be performed using the kmeans() function:

data(iris)

iris_kmeans <- kmeans(x = iris[, c(1:4)], centers = 3)

kmeans

## K-means clustering with 3 clusters of sizes 33, 96, 21
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.175758    3.624242     1.472727   0.2727273
## 2     6.314583    2.895833     4.973958   1.7031250
## 3     4.738095    2.904762     1.790476   0.3523810
## 
## Clustering vector:
##   [1] 1 3 3 3 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 1 1 1 3
##  [36] 1 1 1 3 1 1 3 3 1 1 3 1 3 1 1 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]   6.432121 118.651875  17.669524
##  (between_SS / total_SS =  79.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

kmeans

We can plot the cluster solution using ggplot:

iris$cluster3 <- iris_kmeans$cluster

p <- ggplot(iris, aes(Sepal.Length, y = Sepal.Width, col = factor(cluster3))) +
  geom_point()

kmeans

kmeans

And compare it to the acutal species classification:

table(iris$cluster3, iris$Species)
##    
##     setosa versicolor virginica
##   1     33          0         0
##   2      0         46        50
##   3     17          4         0

kmeans

More/less clusters lead to different solutions:

iris$cluster2 <- kmeans(x = iris[, c(1:4)], centers = 2)$cluster
iris$cluster4 <- kmeans(x = iris[, c(1:4)], centers = 4)$cluster
iris$cluster5 <- kmeans(x = iris[, c(1:4)], centers = 5)$cluster
iris$cluster6 <- kmeans(x = iris[, c(1:4)], centers = 6)$cluster

p <- iris %>%
  gather(key = k, value = cluster, -(Sepal.Length:Species)) %>%
  ggplot(., aes(Sepal.Length, y = Sepal.Width, col = factor(cluster))) +
  geom_point() +
  facet_wrap(~k, ncol = 6)

kmeans

Hierarchical clustering

Hierarchical clustering can be performed using the hclust() function. However, we first need to calcuate the distances (similarities) between each data point pair:

dist <- dist(iris[, 1:4])

There are several methods for calculating the distance (e.g., euclidean, manhatten, …), see ?dist(). With the distance matrix we can hierarchically cluster all data points:

iris_hc <- hclust(dist)

And plot i using the ggdrndro package:

library(ggdendro)

p <- ggdendrogram(iris_hc)

See https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html for further information.

Hierarchical clustering

Reading raster files

Reading raster files

For todays exercise we are going to use data that is stored in a raster stack, that is a stack of gridded rasters with matching spatial resolution and extent.

Reading a raster stack can be done using the stack() function in the raster package:

library(raster)

clim <- stack("Data/climate.envi")

Raster stacks can be of any format (geoTiff, ENVI, grid, bsq, etc.) and with or without spatial reference. For reading a single raster instead of a stack, you can use the raster() function.

Reading raster files

clim
## class       : RasterStack 
## dimensions  : 709, 1306, 925954, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : 16.90833, 27.79167, 44.33333, 50.24167  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names       :       Band.1,       Band.2,       Band.3,       Band.4 
## min values  :   0.08333333, -52.83333333, -26.41666667,  38.41666667 
## max values  :     173.2500,      77.2500,     124.6667,     146.6667
names(clim) <- c("Tmin", "Tmean", "Tmax", "Precip")

Reading raster files

Plotting the raster is easily achieved by the plot() command. However, you have to select a layer for plotting:

plot(clim, y = 1)

Reading raster files

Reading raster files

Rasters can have millions of pixels, which can make computations very slow. To deal with this issue, we are going to sample 10,000 pixels from the raster:

clim_samp <- sampleRandom(clim, size = 10000)
clim_samp <- as.data.frame(clim_samp) # sampleRandom returns a matrix

clim_samp[1:5, ]
##       Tmin    Tmean      Tmax   Precip
## 1 134.7500 32.66667  83.41666 54.33333
## 2 158.0000 58.16667 107.75000 44.00000
## 3 113.3333 25.16667  69.00000 63.41667
## 4 141.0833 32.58333  86.66666 51.58333
## 5 118.2500 30.83333  74.33334 55.16667

Reading raster files

library(GGally)

p <- clim_samp %>%
  sample_n(., size = 100) %>%
  ggpairs(.)

Reading raster files

Principle Component Analysis

Principle Component Analysis

There are two functions in R for performaing a PCA: princomp() and prcomp(). While the former function calculates the eigenvalue on the correlation/covariance matrix using the function eigen(), the latter function uses singular value decomposition, which numerically more stable. We here are going to use prcomp().

Principle Component Analysis

pca <- prcomp(clim_samp, scale. = TRUE)

You should set the scale. argument to TRUE, to scale the data matrix to unit variances. The default is scale. = FALSE. You can also use scale() to standardize the original data matrix and then enter the standardized data matrix to prcomp.

Principle Component Analysis

The prcomp object stores the rotations (eigenvectors) which define the contribution of each variable to a component:

pca$rotation
##               PC1        PC2          PC3           PC4
## Tmin   -0.5274767 -0.2337721  0.678194066 -4.551612e-01
## Tmean  -0.5271176 -0.2277391 -0.733096738 -3.644874e-01
## Tmax   -0.5320480 -0.2331100  0.051063234  8.123898e-01
## Precip  0.4010490 -0.9160487 -0.003811388  3.914667e-05

Principle Component Analysis

The eigenvalues are also stored in the prcomp object:

pca$sdev
## [1] 1.85194821 0.73096506 0.18965623 0.00290114

And can be easily translated into variance explained:

var_exp <- data.frame(pc = 1:4,
                      var_exp = pca$sdev^2 / sum(pca$sdev^2))

var_exp$var_exp_cumsum <- cumsum(var_exp$var_exp)

p <- ggplot(var_exp) +
  geom_bar(aes(x = pc, y = var_exp), stat = "identity") +
  geom_line(aes(x = pc, y = var_exp_cumsum))

Principle Component Analysis

Principle Component Analysis

We can now apply the PCA transformation to the whole raster stack:

clim_pca <- raster::predict(clim, pca, index = 1:4)

Principle Component Analysis

Principle Component Analysis

Principle Component Analysis

Principle Component Analysis

Now the magic!

Let's combine PCA and kmeans:

pca_values <- as.data.frame(clim_pca)
pca_kmeans <- kmeans(pca_values[, 1:3], centers = 5)
kmean_raster <- raster(clim_pca)
values(kmean_raster) <- pca_kmeans$cluster
plot(kmean_raster)

Now the magic!